tidyverse/dplyr and data.tableAbout this document:
This is an R Markdown Notebook. When opening it in R, it becomes interactive, and you can execute all of the code within this notebook You can then reproduce the results of this document. When viewed in a browser, you can explore the output (e.g., scroll through tables).
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. Important: many chunks depend on each other - i.e., you can only run later chunks if you have ran the chunks above. So when you get an error message, try to run previous chunks first.
Learning objectives:
Mandatory for everybody
Being able to explain, in words (not in programming code), what types of different operations there are, and how they change their input data.
Example of a multiple-choice question you may encounter an a test:
Question: “Mike has a dataset with primary key date and track ID for 100 tracks. Because the data is available at the daily level for one year, that’s quite a number of rows (365 x 100 = 36,500), and difficult to read. Can you suggest Mike a data operation that would make this data more accessible for a decision-maker interested in retaining the daily data?”
Answer: “The data is stored in long format. You can convert it to wide format. Ideally, the primary key of that data will be date, and the columns will become track_id. The new data set then is only 365 rows long and 100 columns wide.”
Optional for those who wish to enhance their R skills
dplyr/tidyverse (which has a very “clean” and easy-to-use syntax, but is slow on big data sets), and data.table (which is a bit messier to write up, but shows extremely good performance on large datasets)pandas? You can fork the originating GitHub repository, commit your changes, and then issue a pull request to us & we’re more than happy to integrate/update!Prerequisites:
install the following packages: install.packages(c('RMySQL','data.table','tidyverse', 'dplyr'))
Let’s first load the packages so that R knows how to “handle” MySQL, etc.
library(RMySQL) # loads the MySQL package
library(data.table) # loads the data.table package
library(dplyr) # loads the dplyr library
library(tidyr) # loads the tidyr library
Let us now also load some data from the classes’ MySQL server into R. Recall that the database holds a copy of daily data on the Top 200 Charts, as shown on https://spotifycharts.com.
database = dbConnect(MySQL(), user = 'student', password='rsm!2020',
dbname='rsm', host = 'sql.tilburg-digital.com')
dbListTables(database) # shows a list with tables in the database
## [1] "artists" "echonest" "plays" "tracks"
Our database consists of the four tables listed above. Now, let us explore these tables a bit. Writing the R code to run a particular query will take you a couple of lines - every time you wish to run code! That’s why we first write (from scratch) a couple of functions that will ease data retrieval later on. Don’t worry too much about understanding these functions at this point - just load them.
# Let's define a function to load some sample data from the SQL database server (the first X rows)
get_sample_data = function(tablename) {
rs = suppressWarnings(dbSendQuery(database, paste0("SELECT * FROM ", tablename, " LIMIT 1000")))
data = fetch(rs, n = -1)
}
# Let's define a function that will ease running queries on our MySQL database.
run_query = function(query, n = -1) {
rs = suppressWarnings(dbSendQuery(database, query)) # runs the query
data = fetch(rs, n = n) # fetches n rows of the result (here: -1 indicates to fetch ALL rows)
return(data) # returns the result back to the main program
}
We can now use these functions to run some simple queries on our data, e.g., to take a look at three of the tables. By default, the get_sample_data function loads 1000 rows from the SQL database - we do this to save some bandwith.
artists <- get_sample_data('artists')
artists
tracks<-get_sample_data('tracks')
tracks
plays<-get_sample_data('plays')
plays
Let us also store a copy of the data in actual variables in R, so that we can use them with dplyr and data.table.
# To use the data with data.table, let us convert them to "data.table".
dt_artists = data.table(artists)
dt_tracks = data.table(tracks)
dt_plays = data.table(plays)
# To use the data with dplyr, let's convert them to a so-called "tibble":
df_artists= tbl_df(artists)
df_tracks = tbl_df(tracks)
df_plays = tbl_df(plays)
The first set of operations we’re going to review are “filters”, which sometimes are also referred to as querying (“searching”), or subsetting your data (“showing only a part of what is there”).
There are two types of filters: filtering for particular rows, and filtering for particular columns.
Let’s start querying for some observations in rows. For example, we could “filter” for all artists that have the name Adele.
SQL
run_query('SELECT * FROM artists WHERE artists.name = "Adele"')
dplyr
To use the dplyr syntax, we have to run our commands on the df_ tables saved above.
filter(df_artists, name == 'Adele')
data.table
To use the data.table syntax, we have to run our commands on the dt_ tables savbed above.
dt_artists[name == 'Adele']
Let’s now proceed with “filtering” for particular columns. For example, we can only show the “name” column from the artist table. In our SQL queries, we now use the LIMIT option to only retrieve a few rows at a time (which saves us some bandwith).
SQL
run_query('SELECT name FROM artists LIMIT 100')
dplyr
select(df_artists, name) # selects everything
data.table
dt_artists[, c('name')] # selects everything
The first set of operations we’re going to review are “filters”, which sometimes are also referred to as querying (“searching”), or subsetting your data (“showing only a part of what is there”).
There are two types of filters: filtering for particular rows, and filtering for particular columns.
Let’s start querying for some observations in rows. For example, we could “filter” for all artists that have the name Adele.
SQL
run_query('SELECT * FROM artists WHERE artists.name = "Adele"')
dplyr
To use the dplyr syntax, we have to run our commands on the df_ tables saved above.
filter(df_artists, name == 'Adele')
data.table
To use the data.table syntax, we have to run our commands on the dt_ tables savbed above.
dt_artists[name == 'Adele']
Sometimes, it’s super useful to order your data (or the results of some query). For example, let’s sort the total plays in Argentinaß before May 2015.
SQL
run_query('SELECT * FROM plays WHERE date < "2015-05-01" AND country = "ar" ORDER BY streams ASC') # from lowest to highest streams
#run_query('SELECT * FROM plays WHERE country = "NL" ORDER BY streams DESC') # from highest to lowest streams
dplyr
df_plays %>% filter(country=='ar' & date < as.Date('2015-05-01')) %>% arrange(streams) # ascending order
df_plays %>% filter(country=='ar' & date < as.Date('2015-05-01')) %>% arrange(desc(streams)) # descending order
data.table
setorderv(dt_plays, 'streams') # ascending
dt_plays[country=='ar' & date <= as.Date('2015-05-01')]
setorderv(dt_plays, 'streams', order = -1L) # descending
dt_plays[country=='ar' & date <= as.Date('2015-05-01')]
Probably one of the most important operations on data - even before “reshaping” it - is to aggregate/summarize data. Typically, aggregations are used to go from “one unit of analysis to another”.
Think about our plays table, for example. The primary key of that table is date-country-position (each row can be uniquely identified by the combination of a date, country, and position).
sumSuppose now we’d like to “summarize” the data by day - e.g., how many plays in total were there on a given day?
SQL
run_query('SELECT date, SUM(streams) as total_plays FROM plays GROUP BY date LIMIT 10;')
dplyr
df_plays %>% group_by(date) %>% summarise(total_plays = sum(streams))
Note that the result looks a little different. That’s because SQL is using ALL data from our database, whereas the df_plays data only holds a sample of the data.
data.table
dt_plays[, list(total_plays = sum(streams)), by = c('date')]
Of course, there are multiple operations available to summarize date: for example, simply counting rows, counting the number of unique values, or taking the minimum or maximum within a particular group.
Here are a few examples:
SQL
run_query('SELECT COUNT(*) FROM tracks') # how many tracks are listed in the database?
run_query('SELECT COUNT(DISTINCT name) from artists') # how many artists are listed in the database?
run_query('SELECT MIN(streams) FROM plays WHERE country = "ar"') # what were the minimum streams for a song to enter the Argentenian charts?
run_query('SELECT AVG(streams) FROM plays WHERE country = "ar"') # what is the average number of streams for a song in the Argentenian Spotify Top 200?
dplyr
summarise(df_tracks, total_tracks = n())
summarise(df_artists, total_artists = n_distinct(name))
df_plays %>% filter(country == 'ar') %>% summarise(min_streams= min(streams))
df_plays %>% filter(country == 'ar') %>% summarise(avg_streams= mean(streams))
Many times, we’re in need of creating “new” variables from old variables.
For example, this way, we can create a dummy variable, indicating whether a given chart position was in the Top 10 (=1), or not (=0).
SQL
run_query('SELECT country, date, position, streams, (position <= 10) AS top10 FROM plays WHERE country = "ar"')
dplyr
df_plays %>% mutate(top10 = position<=10)
data.table
dt_plays[, top10 := ifelse(position<=10, TRUE, FALSE)]
Creating continuous variables from scratch in SQL is quite complex; also, you loose a lot of flexibility the R gives you in using extra mathematical functions (after all, R is a statistical language, while SQL is “only” a database query language). Therefore, we give some examples in dplyr and data.table: here, taking the natural logarithm of a variable.
dplyr
df_plays2 = df_plays %>% mutate(ln_streams = log(streams))
df_plays2
data.table
dt_plays2 = dt_plays[, ln_streams := log(streams)]
dt_plays2
Deleting variables in SQL doesn’t make sense (unless you’re a database manager and would like to seriously change the structure of the database). Below, we show you how you can delete variables in dplyr and data.table.
dplyr
df_plays2 %>% select(-ln_streams)
data.table
dt_plays2[, ln_streams := NULL]
dt_plays2
In wrangling with data, you frequently need to change the format of your data to “fit” the input requirements of some analysis methods.
Prepare some example data
For this example, let us retrieve some data the number of plays for Adele and Drake across countries. Let’s not worry too much about the query to retrieve that data (it’s quite complicated). Instead, we focus on the reshaping operations later.
bycountry = run_query("SELECT artists.name as name, country, SUM(streams) as total_plays FROM plays LEFT JOIN tracks ON tracks.id = plays.track_id LEFT JOIN artists ON artists.id = tracks.artist_id WHERE artists.name IN ('Adele', 'Drake') GROUP BY artists.name, plays.country")
# let us also prepare this data for dplyr and data.table
df_bycountry = tbl_df(bycountry)
dt_bycountry = data.table(bycountry)
bycountry # the head() and tail() commands would only show you the top (bottom) six rows
Explore the data a bit (e.g., on the first page, on the last page). What do you see?
You probably agree it’s quite length to scroll through the table, as the table is very long. That’s why the format of this table is called “long” format: we use many rows to store the result of Adele and Drake, for each country. The primary key of this long-shaped data set is name and country, as every row can be uniquely pointed to by the combination of these column names.
Now let’s suppose we wanted to prepare an overview for a music manager on the performance of these two artists across all countries. Further, let’s suppose the manager is interested in learning in which country any of these two artists wasn’t making the Top 200 at all.
You probably agree that the long-shaped format isn’t very well suited for this - or would you want the manager to scroll through dozens of result pages? Right. Therefore, to prepare a good overview, we convert the data from its “long” format to the so-called “wide” format (which basically means: many columns).
Watch this! (Note we’re going to convert the result from above to tibbles and data.tables first)
dyplr
result1 = spread(df_bycountry, country, total_plays)
result1
data.table
result2 = dcast(dt_bycountry, name~country, value.var='total_plays')
result2
Sometimes, we also just wish to go back from wide to long. Let’s use the results from above as input to convert those back to the long format.
dplyr
result1b = gather(result1, 'country', 'total_plays', -name)
result1b
data.table
result2b = melt(result2, c('name'), variable.name='country', value.name = 'total_plays')
result2b
We can use code to unite columns into one. For example, let’s create an overview about the Top 3 countries in terms of plays for each of the artists.
Let’s first create an indicator variable for the Top 3, filter on it.
# we use data.table for the conversion
setorderv(result2b, c('name', 'total_plays'), order = c(1, -1), na.last=T)
# sort in ascending order by name, and descending order by
result2b[, rank := 1:.N, by = c('name')] # create rank positions
# filter for ranks <= 3
result2b = result2b[rank <= 3]
result2b
Now we can combine the resulting country names with their corresponding streams in one column.
dplyr
result3a <- unite(result2b, 'country_streams', country, total_plays, sep = ' ')
result3a
data.table
# let us directly create the concatenated variable in the data.table
result3b = result2b
result3b[, country_streams := paste(country, total_plays)]
result3b
Now let’s suppose we only see the combined (comma-separated) values in the country_streams column in the example above. How would we separate those values back into individual columns?
Let us first “delete” the individual country/streams columns from the data.
tmp=result3b
tmp=tmp[, ':=' (country=NULL, total_plays=NULL)] # let's first remove the desired result
tmp
Watch how here:
dplyr
result4a = separate(tmp, country_streams, c('country','streams'), sep =' ')
result4a
data.table
# now let's extract the country names from the combined column
tmp[, country:=sapply(strsplit(country_streams, ' '), function(x) x[1])]
# now let's extract the streams from the combined column, converting them back to numerics
tmp[, total_plays:=as.numeric(sapply(strsplit(country_streams, ' '), function(x) x[2]))]
result4b = tmp
result4b
Combining data sets is an essential component of preparing a data for analysis.
The “most” simple combination is a row bind. Suppose you have data set A with 100 rows, and data set B with 150 rows (and both of these data sets have exactly the same column names), then you can bind these data sets together in the same way you would glue two books together. The resulting book has 250 pages.
Let’s first prepare two separate data sets (we just take two different rows from the artist dataset.
datasetA = df_artists %>% filter(name=='Adele')
datasetB = df_artists %>% filter(name=='AC/DC')
dplry
bind_rows(datasetA, datasetB)
data.table
rbindlist(list(datasetA, datasetB))
In simple terms, merging (or joining) dataset A with dataset B means that you add columns from one dataset to the other dataset, given some corresponding key columns in both data sets.
Suppose you have a dataset A with artist names, and a dataset T with track names, and both of these data sets have an ID column for artists, then you can “add” the artist names to the track data set using a join. Specifically, a LEFT JOIN.
Important
To be able to merge datasets, they need to be at the same (or higher) level of aggregation. For example, if you have data set A which measures a playlist followers in weeks, and another data set B that measures the content of the playlists every day, then you can’t merge these two datasets directly.
However, you could do the following:
Left joins add the columns of dataset B to dataset A, and leaves the structure of dataset A intact.
SQL
run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM tracks LEFT JOIN artists ON tracks.artist_id = artists.id')
dplyr
left_join(tracks, artists, by = c('artist_id' = 'id'))
data.table
merge(tracks, artists, by.x='artist_id', by.y='id', all.x=T)
Right joins work the other way around: it keeps all data in the right table (dataset B), and joins available data from the left table (dataset A)
SQL
run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM artists RIGHT JOIN tracks ON tracks.artist_id = artists.id')
dplyr
right_join(artists, tracks, by = c('id' = 'artist_id'))
data.table
merge(artists, tracks, by.x='id', by.y='artist_id', all.y=T)
Inner joins keeps only rows in both data sets.
SQL
run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM artists INNER JOIN tracks ON tracks.artist_id = artists.id')
dplyr
inner_join(tracks, artists, by = c('artist_id' = 'id'))
data.table
merge(tracks, artists, by.x='artist_id', by.y = 'id')
Here, we discuss a few “special” joins.
Let’s take a look at the plays for the track “El Perdon”. Notice the track has mostly been playing in Argentina and Australia, sometime starting in 2015.
run_query("SELECT * from plays WHERE track_id = '4Ld3RS6KBXGsvtfmnjxlVl'")
Suppose now you’d like to prepare a panel data set in which for each day in 2015 and 2016, you’d see the number of streams for that particular song in Argentina and Australia. Merely using the SQL query above isn’t enough - two years have 365*2 days, and the resulting rows above only list data on two particular dates in a year.
What we then require first is an “empty” dataset, only listing the country, and date values for each day in 2015 and 2016.
Let’s create that data first:
countries = c('ar', 'au')
dates = seq(from=as.Date('2015-01-01'), to = as.Date('2016-12-31'), by = '1 day')
tmp = lapply(countries, function(x) data.table(country=x, date=dates))
keydata = rbindlist(tmp)
keydata
Browse this “empty” data set above. Can you confirm it lists all dates in 2015 and 2016, for both countries?
We can now go ahead and merge our song data, using a LEFT JOIN (we want to keep all rows in A, and only merge the rows to it when the song was actually in the Top 200).
dplyr
# Version 1
tmp = df_plays %>% filter(track_id == '4Ld3RS6KBXGsvtfmnjxlVl') %>% mutate(date=as.Date(date))
left_join(keydata, tmp) %>% select(country, date, streams) %>% replace_na(list(streams=0))
## Joining, by = c("country", "date")
# Version 2 is even more elegant/readable
tmp %>% complete(date= seq.Date(from = as.Date('2015-01-01'), to = as.Date('2016-12-31'), by = '1 day')) %>% select(country, date, streams)
Scroll down to about April/May 2015: see those streams there?
data.table
merged_data = merge(keydata, tmp, by = c('country', 'date'), all.x=T)
merged_data[is.na(streams), streams := 0] # replace NAs with 0s
merged_data[, c('country', 'date','streams'),with=F] # only show selected columns
Alright - we’re done “wrangling” with our data. It’s time to save it.
Saving your data in CSV files has several advantages: - program independence: you can open the data in any software (e.g., R, Python, SPSS, …) - the data is likely to be accessible in the future: CSV files have a very simple structure that you can even see while merely looking at it in a text editor.
The disadvantages are: - If you’re working with special encodings (e.g., special characters), CSV files many times break when you try to open them in other software. For example, if you open a CSV file in Excel and save it again, it’s format will be changed and R may encounter difficulties reading it in.
Here, we by all means prefer data.table due to its unprecedented speed in writing data to the disk.
data.table
# Let's compare a regular write (100 times)...
system.time({for (i in 1:100) write.table(plays, 'plays.csv')})
## user system elapsed
## 0.484 0.031 0.228
# To 100 times a data.table fwrite
system.time({for (i in 1:100) fwrite(plays, 'plays.csv')})
## user system elapsed
## 0.026 0.012 0.038
# It's save to assume fwrite always beats R's own write.table.
# So, let's write it to the disk.
fwrite(plays, 'plays.csv')
new_data <- fread('plays.csv')
Sometimes it’s also convenient to store data directly in the RData format. You can also save more than one object in a file. The advantage is you can read in all data at the same time, and the data is also automatically zipped. The disadvantage is that you need R to read in that data (so no independence!), and it may take a while to load in the data (way longer than reading in three CSVs!).
save(plays, artists, tracks, file = 'my_data.RData')
Let’s now read in our saved data.
my_plays <- fread('plays.csv')
head(my_plays)
# remove our existing data first
rm(plays, artists, tracks)
load('my_data.RData')
head(plays)
head(artists)
head(tracks)